844 research outputs found
ShapeStacks: Learning Vision-Based Physical Intuition for Generalised Object Stacking
Physical intuition is pivotal for intelligent agents to perform complex
tasks. In this paper we investigate the passive acquisition of an intuitive
understanding of physical principles as well as the active utilisation of this
intuition in the context of generalised object stacking. To this end, we
provide: a simulation-based dataset featuring 20,000 stack configurations
composed of a variety of elementary geometric primitives richly annotated
regarding semantics and structural stability. We train visual classifiers for
binary stability prediction on the ShapeStacks data and scrutinise their
learned physical intuition. Due to the richness of the training data our
approach also generalises favourably to real-world scenarios achieving
state-of-the-art stability prediction on a publicly available benchmark of
block towers. We then leverage the physical intuition learned by our model to
actively construct stable stacks and observe the emergence of an intuitive
notion of stackability - an inherent object affordance - induced by the active
stacking task. Our approach performs well even in challenging conditions where
it considerably exceeds the stack height observed during training or in cases
where initially unstable structures must be stabilised via counterbalancing.Comment: revised version to appear at ECCV 201
Visual7W: Grounded Question Answering in Images
We have seen great progress in basic perceptual tasks such as object
recognition and detection. However, AI models still fail to match humans in
high-level vision tasks due to the lack of capacities for deeper reasoning.
Recently the new task of visual question answering (QA) has been proposed to
evaluate a model's capacity for deep image understanding. Previous works have
established a loose, global association between QA sentences and images.
However, many questions and answers, in practice, relate to local regions in
the images. We establish a semantic link between textual descriptions and image
regions by object-level grounding. It enables a new type of QA with visual
answers, in addition to textual answers used in previous work. We study the
visual QA tasks in a grounded setting with a large collection of 7W
multiple-choice QA pairs. Furthermore, we evaluate human performance and
several baseline models on the QA tasks. Finally, we propose a novel LSTM model
with spatial attention to tackle the 7W QA tasks.Comment: CVPR 201
Analyzing Modular CNN Architectures for Joint Depth Prediction and Semantic Segmentation
This paper addresses the task of designing a modular neural network
architecture that jointly solves different tasks. As an example we use the
tasks of depth estimation and semantic segmentation given a single RGB image.
The main focus of this work is to analyze the cross-modality influence between
depth and semantic prediction maps on their joint refinement. While most
previous works solely focus on measuring improvements in accuracy, we propose a
way to quantify the cross-modality influence. We show that there is a
relationship between final accuracy and cross-modality influence, although not
a simple linear one. Hence a larger cross-modality influence does not
necessarily translate into an improved accuracy. We find that a beneficial
balance between the cross-modality influences can be achieved by network
architecture and conjecture that this relationship can be utilized to
understand different network design choices. Towards this end we propose a
Convolutional Neural Network (CNN) architecture that fuses the state of the
state-of-the-art results for depth estimation and semantic labeling. By
balancing the cross-modality influences between depth and semantic prediction,
we achieve improved results for both tasks using the NYU-Depth v2 benchmark.Comment: Accepted to ICRA 201
Scrutinizing and De-Biasing Intuitive Physics with Neural Stethoscopes
Visually predicting the stability of block towers is a popular task in the
domain of intuitive physics. While previous work focusses on prediction
accuracy, a one-dimensional performance measure, we provide a broader analysis
of the learned physical understanding of the final model and how the learning
process can be guided. To this end, we introduce neural stethoscopes as a
general purpose framework for quantifying the degree of importance of specific
factors of influence in deep neural networks as well as for actively promoting
and suppressing information as appropriate. In doing so, we unify concepts from
multitask learning as well as training with auxiliary and adversarial losses.
We apply neural stethoscopes to analyse the state-of-the-art neural network for
stability prediction. We show that the baseline model is susceptible to being
misled by incorrect visual cues. This leads to a performance breakdown to the
level of random guessing when training on scenarios where visual cues are
inversely correlated with stability. Using stethoscopes to promote meaningful
feature extraction increases performance from 51% to 90% prediction accuracy.
Conversely, training on an easy dataset where visual cues are positively
correlated with stability, the baseline model learns a bias leading to poor
performance on a harder dataset. Using an adversarial stethoscope, the network
is successfully de-biased, leading to a performance increase from 66% to 88%
Goal-Conditioned End-to-End Visuomotor Control for Versatile Skill Primitives
Visuomotor control (VMC) is an effective means of achieving basic
manipulation tasks such as pushing or pick-and-place from raw images.
Conditioning VMC on desired goal states is a promising way of achieving
versatile skill primitives. However, common conditioning schemes either rely on
task-specific fine tuning - e.g. using one-shot imitation learning (IL) - or on
sampling approaches using a forward model of scene dynamics i.e.
model-predictive control (MPC), leaving deployability and planning horizon
severely limited. In this paper we propose a conditioning scheme which avoids
these pitfalls by learning the controller and its conditioning in an end-to-end
manner. Our model predicts complex action sequences based directly on a dynamic
image representation of the robot motion and the distance to a given target
observation. In contrast to related works, this enables our approach to
efficiently perform complex manipulation tasks from raw image observations
without predefined control primitives or test time demonstrations. We report
significant improvements in task success over representative MPC and IL
baselines. We also demonstrate our model's generalisation capabilities in
challenging, unseen tasks featuring visual noise, cluttered scenes and unseen
object geometries.Comment: revised manuscript with additional baselines and generalisation
experiments; 11 pages, 8 figures, 7 table
RELATE: Physically Plausible Multi-Object Scene Synthesis Using Structured Latent Spaces
We present RELATE, a model that learns to generate physically plausible
scenes and videos of multiple interacting objects. Similar to other generative
approaches, RELATE is trained end-to-end on raw, unlabeled data. RELATE
combines an object-centric GAN formulation with a model that explicitly
accounts for correlations between individual objects. This allows the model to
generate realistic scenes and videos from a physically-interpretable
parameterization. Furthermore, we show that modeling the object correlation is
necessary to learn to disentangle object positions and identity. We find that
RELATE is also amenable to physically realistic scene editing and that it
significantly outperforms prior art in object-centric scene generation in both
synthetic (CLEVR, ShapeStacks) and real-world data (cars). In addition, in
contrast to state-of-the-art methods in object-centric generative modeling,
RELATE also extends naturally to dynamic scenes and generates videos of high
visual fidelity. Source code, datasets and more results are available at
http://geometry.cs.ucl.ac.uk/projects/2020/relate/
Downramp-assisted underdense photocathode electron bunch generation in plasma wakefield accelerators
It is shown that the requirements for high quality electron bunch generation
and trapping from an underdense photocathode in plasma wakefield accelerators
can be substantially relaxed through localizing it on a plasma density
downramp. This depresses the phase velocity of the accelerating electric field
until the generated electrons are in phase, allowing for trapping in shallow
trapping potentials. As a consequence the underdense photocathode technique is
applicable by a much larger number of accelerator facilities. Furthermore, dark
current generation is effectively suppressed.Comment: 4 pages, 3 figure
Unlocking the Power of Representations in Long-term Novelty-based Exploration
We introduce Robust Exploration via Clustering-based Online Density
Estimation (RECODE), a non-parametric method for novelty-based exploration that
estimates visitation counts for clusters of states based on their similarity in
a chosen embedding space. By adapting classical clustering to the nonstationary
setting of Deep RL, RECODE can efficiently track state visitation counts over
thousands of episodes. We further propose a novel generalization of the inverse
dynamics loss, which leverages masked transformer architectures for multi-step
prediction; which in conjunction with RECODE achieves a new state-of-the-art in
a suite of challenging 3D-exploration tasks in DM-Hard-8. RECODE also sets new
state-of-the-art in hard exploration Atari games, and is the first agent to
reach the end screen in "Pitfall!"
- …